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Abstract 

Background: RNA-seq, a massive parallel-sequencing-based transcriptome profiling method, provides digital data 
in the form of aligned sequence read counts. The comparative analyses of the data require appropriate statistical 
methods to estimate the differential expression of transcript variants across different cell/tissue types and disease 
conditions. 

Results: We developed a novel nonparametric empirical Bayesian-based approach (NPEBseq) to model the RNA-seq 
data. The prior distribution of the Bayesian model is empirically estimated from the data without any parametric 
assumption, and hence the method is "nonparametric" in nature. Based on this model, we proposed a method for 
detecting differentially expressed genes across different conditions. We also extended this method to detect 
differential usage of exons from RNA-seq data. The evaluation of NPEBseq on both simulated and publicly available 
RNA-seq datasets and comparison with three popular methods showed improved results for experiments with or 
without biological replicates. 

Conclusions: NPEBseq can successfully detect differential expression between different conditions not only at gene 
level but also at exon level from RNA-seq datasets. In addition, NPEBSeq performs significantly better than current 
methods and can be applied to genome-wide RNA-seq datasets. Sample datasets and R package are available at 
http://bioinformatics.wistar.upenn.edu/NPEBseq. 



Background 

The advent of massive parallel sequencing, popularly 
known as Next-Generation Sequencing (NGS), is 
allowing whole genomes and transcriptomes to be se- 
quenced with extraordinary speed and accuracy, provid- 
ing insights into the bewildering complexity of gene 
expression at both gene and isoform levels [1]. With 
decreasing sequencing cost per base, RNA-Seq approach 
has become a desirable method to get a complete 
view of the transcriptome and to identify differentially 
expressed rare transcripts and isoforms [2]. The RNA- 
seq assay provides sensitive and accurate digital counts 
for the exon regions of expressed transcripts in a given 
sample. The count of short sequence reads for each exon 
region is the sum of read counts belonging to the over- 
lapping exon region of different transcript isoforms that 
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are expressed in the sample. Therefore, estimating the 
transcript-level expression from the collection of counts 
of short read sequences that map to exons (or exon 
slices) and exon junctions is a computationally challen- 
ging problem, which has been recently attempted by us 
and others, in programs such as IsoformEx [3], rSeq [4], 
Cufflinks [5], RSEM [6], BASIS [7], and GPSeq [8]. 
However, none of these methods showed good agree- 
ment with qRT-PCR measurements, a gold standard in 
measuring differential RNA abundance between samples 
[3]. The statistical challenges in analyzing RNA-Seq data 
arise from many perspectives. While some sources of 
error are due to inherent problems with the technology, 
some are contributed at laboratory or experimental 
levels, leading to non-biological or technical variation 
across samples. Therefore, there is a critical need for in- 
vestigation of other statistical methods for normalization 
and differential expression analysis of RNA-seq data 
across different conditions. 
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RNA-seq experiments are now frequently employed for 
identifying genes and alternatively spliced gene isoforms 
that are differentially expressed across distinct tissue/cell 
types and disease conditions [9]. This amounts to compar- 
ing one condition, A, with another condition, B, and pro- 
ducing a ranked list of differentially expressed genes 
according to the statistical significance of observed expres- 
sion difference or fold-change between A and B [10,11]. 
Thus, proper normalization between samples is crucial 
before differential expression (DE) analysis and, to a 
certain degree, the two aspects are linked with each 
other. Normalization can be divided into within-sample 
normalization and between- samples normalization [12]. 
DE analysis is the study of the difference in absolute gene 
expression levels between two conditions. However, simi- 
lar to microarray technology, RNA-seq is a relative abun- 
dance measure technology and does not allow for the 
measurement of absolute transcript abundance. This is be- 
cause molecules are sampled proportionately from a large 
pool of cells and the initial number of cells and other tech- 
nical factors are usually difficult to estimate or unknown. 
The standard procedure for computing the proportion of 
sequence reads that map to a gene relative to the total 
number of reads obtained in that RNA-seq experiment 
and for comparing those proportions across different sam- 
ples can lead to high false positive rate. For example, a 
common method for normalization is to divide the gene- 
wise read counts by corresponding gene length and the 
total number of mapped reads to the genome. Recent re- 
ports show that the latter method, based on the total 
count of mapped reads, is not a robust method [13] and 
several alternative methods have been proposed. For ex- 
ample, an empirical strategy that equates the overall ex- 
pression levels of genes between samples under the 
assumption that the majority of them are not DE was pro- 
posed recently [10,14,15]. Alternatively, the widely used 
quartile normalization method in the microarray field was 
also adapted for between-sample normalization of RNA- 
seq [16]. A recent review evaluated seven proposed 
normalization methods for the differential analysis of 
RNA-seq data by using a varied group of real and simu- 
lated datasets involving different species and experimental 
designs [13]. They concluded that the methods proposed 
in DESeq [17] and edgeR [18] have the most relative satis- 
factory behaviour compared to the others. 

Similarly, several tools have been developed for DE ana- 
lysis of RNA-seq data. The Poisson model has been suc- 
cessfully used to account for technical variations in 
RNA-seq data [4]. When biological replicates are available, 
the negative binomial distribution is commonly used to 
model the over-dispersion in the count data, such as 
DESeq and edgeR. There are also pure non-parametric 
methods, which do not assume any particular distribution 
for the data, e.g. NOISeq [11]. Approaches within a 



Bayesian framework for differential expression in RNA- 
Seq data have also been developed by many researchers, 
such as baySeq [19], GFOLD [20], ShrinkSeq [21] and 
EBSeq [22]. It is acknowledged that Bayesian approach 
can be used to obtain accurate and robust estimates by 
sharing information across all genes when sample size is 
small [23]. In baySeq, the genes are ranked according 
to the posterior probabilities of differential expression 
between conditions, using an empirical Bayes framework. 
To infer the posterior probability, the gene expression 
prior factors are integrated out by an approximation 
method [24] with mean and dispersion parameters empir- 
ically and iteratively estimated from the entire set of genes 
through a quasi-likelihood method. GFOLD assumes uni- 
form prior distribution (vague prior) of gene expression 
level for technical replicate model. For data with biological 
replicates, a hierarchical model with log-normal prior for 
the gene expression is used to account for the biological 
variation. The posterior distribution of fold change is 
obtained through sampling. ShrinkSeq, which also takes a 
Bayesian perspective, is presented in a framework of gen- 
eralized linear model setting, which infers the DE coeffi- 
cient in the GLM directly instead of inferring the gene 
expression level first. ShrinksSeq explores both parametric 
mixture prior and non-parametric prior for the DE coeffi- 
cient and extends the INLAs (integrated nested Laplace 
approximation) method to infer the marginal posterior 
distribution under non-parametric prior and shows the su- 
perior performance of non-parametric prior than paramet- 
ric prior. EBSeq, similar to baySeq, ranks the genes/ 
isoforms by posterior probability of DE, but assumes a 
parametric form of the prior distribution for the gene/iso- 
form expression with parameters estimated from the data 
by method of moments and EM. All the aforementioned 
methods do not provide the close-form of posterior distri- 
bution of fold change. Because sequencing of cDNA reads 
is basically a sampling procedure, it is important to note 
that a large number of genes are unseen in a typical RNA- 
seq sample due to low expression or the limited depth of 
the experiment. For example, only approximately 0.0013% 
of the total number of available molecules in a RNA library 
are sampled in one lane of a typical Solexa/Illumina GAIIx 
RNA-seq experiment [25]. Further, the fact that a small 
number of highly expressed genes consume a significant 
fraction of the total sequence reads can also influence the 
statistical inference procedures. These limitations affect the 
estimation of DE or differences in relative transcript distri- 
bution between samples. For almost all currently developed 
RNA-seq DE methods, genes with low read counts are 
usually omitted from the analysis because of unreliable es- 
timation. Another issue is that a zero read count in one 
condition leads to unrealistic estimation of fold change. 

Here, we developed a novel method to model the RNA- 
seq data and detect differentially expressed genes and 
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exons across different conditions. To mitigate the biases 
caused by the nature of sampling and reliably estimate the 
expression levels of those unseen and lowly expressed 
genes, we adopted a previously developed Poisson mixture 
model to empirically estimate the prior distributions of 
read counts completely from the data [26]. We propose a 
nonparametric, empirical Bayesian-based approach to 
model the RNA-seq data. We prepared five datasets, three 
simulated and two publicly available RNA-seq datasets, 
for systematically evaluating the performance of the new 
method. Also, the novel method is compared with the 
other popular methods for RNA-seq DE analysis, both 
using simulated and real RNA-seq datasets. 

Implementation 

A few of the earlier RNA-seq assessment studies have 
reported highly reproducible results with little technical 
variation [27,28], suggesting that the inclusion of technical 
replicates in the experimental plan is usually not essential. 
Numerous RNA-Seq studies have used the Poisson model 
to perform testing for differential gene expression. The 
Poisson model assumes equality of mean and variance of 
read counts per gene across replicates. Therefore, pooling 
technical replicates together to give read counts for each 
biological replicate does not lead to loss of information. 
Thus, we first discuss one replicate per condition and then 
consider biological replicates. 

Model for single replicate 

Let y be the expression level of one gene under one condi- 
tion and x be the number of observed reads mapped to 
this gene. It is well known that x follows a binomial distri- 
bution and can be approximated well by a Poisson distri- 
bution with mean X = ydl, where 1 is the gene length and 
d is the normalization constant reflecting the sequencing 
depth. Given a prior mixing distribution G (with probabil- 
ity density function g(X)) on X, the posterior distribution of 
X is g(A)^/h G (x), where h G (x) = jA*/x ! e~ A dG(A) is a 
G -mixture of Poisson. 

A gene is expressed if, and only if, x > 1. Conditioning 
on x > 1, x follows a Q-mixture of zero-truncated 
Poisson h G (x)/(l - h G (0)) or a mixture fq(x) of trun- 
cated Poisson, where 



f Q (x) 



h G (x) 
l-h G (0) J x!(e x -1) 
l-e- x )dG(\) 



dQ(X), 



(1) 



dQ(X) 



/(l-e-n)dG(n)' 



Let n x denote the number of genes with exactly x reads 
in the sample. The conditional nonparametric maximum 
likelihood estimator Q for Q is Q = argmax Z n x logf 0 (x) , 

x>l V 

whose calculation is discussed in [29,30] and the calculation 



details under the context of RNA-seq are provided in the 
Additional file 1. There is a one-to-one mapping between 
G and Q from equation (1). The posterior distribution of X 



is then given by X|x~g(X) /h^(x). An empirical Bayes 



estimator for X is 



E(A|x) 



(x+l)h 6 (x+l) 



(2) 



Let X A and X B denote the read counts that represent 
the true expression level of a gene and G A and G B de- 
note the corresponding prior distributions, under con- 
ditions A and B, respectively. However, as mentioned 
previously, since NGS is like sequencing a set of sam- 
pled reads from a pool of expressed sequences of gene, 
the read counts that are obtained, say x A and x B , de- 
note the corresponding reads counts obtained in con- 
ditions A and B. The posterior distribution of log 

(d^j^j, which is log fold change of expression level of 
a gene, has a closed-form formula and is easy to derive, 
because Ga and Gg follow probability distribution of 
discrete form. 

The normalization constant d can be inferred from 
some previous available methods, for example the 
methods proposed in DESeq or PoissonSeq [31]. This 
can also be calculated based on the assumption that the 
expected values of log-fold change of the majority of 



genes are zeros, 



E [ lo s( d vi^)] 



' Ab|x b 



(3) 



Thus, we rank the genes by the values of E[log(^j^)] 

first and then estimate d by using the genes falling in the 
(e, 1 - e) quantile of all those values. In this paper, we used 
e=0.25. That is, we used half of the genes to estimate d. 

NPEBseq tests the hypothesis that the difference in the 
gene expression level between conditions A and B is 
above a user-defined cutoff A, i.e., the probability that 



, ,,Xa|x A , 

log(c W 



> A 



(4) 



The default value for A is log(2). We consider this as 
our own pre-defined p-value. The false discovery rate is 
controlled with Benjamini-Hochberg adjustment. 

Model dealing with biological replicates 

RNA-seq datasets with large numbers of biological repli- 
cates are increasingly generated by many laboratories 
and consortia, for example, HapMap [32], ENCODE 
[33], and TCGA projects [34]. TCGA data consists of 
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hundreds of RNA-seq biological replicates for each can- 
cer condition. Dealing with the large number of bio- 
logical replicate data is challenging. Recent studies have 
found that while the Poisson model is appropriate for 
technical replicates of the same RNA samples, it can be 
a poor fit for biological replicates. Here we propose a 
hierarchical Bayesian model to account for the over- 
dispersion in the read counts. 

Let c denote the number of biological replicates for 
one condition and we assume that 

Xjj ~ Poisson (dj eij ) , 

eij ~ Gamma(Xi, 0), with mean = Xi and variance = Xi6, 
Xi ~ G with g(X) denoting the pdf of G, 

where Xij is the number of reads for gene i and replicate 
j; e^ is the expression index; Xi is the expression level of 
gene i under this condition A; 6 is the scale parameter 
of Gamma distribution; and dj is normalization constant 
for replicate j. The prior distribution G is inferred as be- 
fore by using the sample that has the largest data depth 
under each condition. 

Here we are interested in inferring the posterior distri- 
bution of fold change for each gene, in which the 1 will 
be cancelled out, so simply letting 1=1 does not 
change the calculation. Let x t = (xii,x i2 , ...,x ic ) and 

= (en,ei2, ...,ei c ). Based on the aforementioned model, 
the joint posterior distribution of e^, Xi is given by 



X b ei|xi~g(Xi) n 



£ 1 



e>/^ e 



r(Xi/e)e t 



g(*0.n 



£ 1 



.Xij+Xi/e-i 



4 



j J d Xi j 



r(x i /e)e t 



It is known that marginal conditional ei|(xi,Xi) follows 
Gamma distribution and can be easily integrated out 
(further details are given in the supplementary methods 
sections). Thus, the log transformed marginal posterior 
distribution of \ is given by 



logfXil XiVlogteM) + EEiogfi 

j=l k=l V 

+ io g e ^ x^- ^(x^ + Xi/e)iog(dj6 + 1) 



The p- values and FDR can be computed by equation (4). 

Empirical Bayes methods have been used to estimate 
the degrees of over-dispersion in the data. Based on our 
hierarchical model, we also propose an empirical Bayes 
method to estimate the dispersion parameter 6. It is 



known that the conditional variable Xij|Xi follows nega- 
tive binomial distribution (mixture of Poisson with 
Gamma prior) and the expected value and variance of it 
are given by 

E[xij|Xi] = djXi Var[xij|Xi] = Xi(l + edj)dj. 

Although the marginal distribution of x^ is unclear, 
the expected value and variance can be computed in the 
following ways: 

E[xij] =E[E[xij|Xi]] =djE[Xi] 
Var [x^] = Var [E |Xj ] + E [Var [ Xij \\] ] 
= VarfdjXi] + (l + edj)djE[Xi] 

So we estimate 6 by, 

VarfxjjJ-d^Varpli] 1 
djE[xij] dj' 

Similar to the estimation of G, 0 is also estimated by 
using the sample that has the largest data depth under 
each condition. 

Differential exon usage analysis from RNA-seq data 

RNA-seq also provides information for the study of al- 
ternative splicing. DE analysis of individual transcripts is 
essential in many comparative studies because of 
isoform-level changes in gene expression between condi- 
tions [9]. Recently, two tools, Cufflink [35] and BitSeq 
[36], have been proposed to identify differential expres- 
sion of transcripts by first estimating the expression of 
the transcripts. The expression or abundance estimates 
may contain significant correlated uncertainties that re- 
duce the power for inference of differential expression 
[37]. Another tool, DEXSeq [37], proposed an exon- 
centric analysis to test for differential exon usage in 
RNA-seq data based on a generalized linear model. The 
input of DEXSeq is a table that contains read counts for 
each exon of every gene in each sample. Note that one 
exon may be cut into two or more parts if its boundary 
is not the same in all transcripts. The basic unit for 
counting the number of reads overlapped is called 
"counting bin" in this manner, similar to the definition 
of exon slice used in IsoformEx algorithm [3]. 

DEXSeq tests if each counting bin is differentially used 
between conditions. Inspired by this, we propose a 
method to detect different exon usage based on our 
Bayesian hierarchal model. Assuming that a gene is 
expressed under two different conditions, A and B, let 
t A k and t B k denote the expression level of counting bin k 
of this gene and t Ak and t B i< denote the observed read 
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counts overlapping with it. The differential exon detec- 
tion method involves the following steps: 

1. The posterior distribution of t A i<|yAk and t B k|y B k is 
derived based on our model applied to counting bin 
read count data. 

2. Test the fitness of the distribution against the null 
hypothesis: the proportion of the number of reads 
overlapping with a counting bin to that of all the 
reads overlapping with the gene does not change 
between conditions (same as in DEXSeq). 

3. Finally, define the p-value as the probability of 

| lo gTO- lo gTO| > A ' where A denotes a 
user-defined cutoff that represents the extent of 
differential expression one wishes to identify 
between conditions. For example, if A is set as log 
(1/3), it will test if the exon inclusion level of an 
exon is less than 1/3 or more than a three-fold 
change between conditions. Such an extreme switch 
of exon usage between conditions is a strong 
indicator of functional alterative splicing events. 

Within-condition quantile normalization 

For cases without replicates, the normalization constant 
can be computed by equation (3). With biological repli- 
cates, dj can be computed by the method proposed 
in DESeq or the trimmed mean of log-folder change 
method [10]. Here we propose within-condition quantile 
normalization based on the assumption that the distribu- 
tions of read count data within conditions are common. 
The samples in the same condition are first quantile nor- 
malized relative to the one that has the largest data depth 
and then equation (3) is applied to do the normalization 
between conditions. Quantile normalization of read count 
data for samples coming from different conditions might 
not be proper due to the fact that longer genes have more 
reads and they might be differentially expressed between 
conditions, which can put the gene expression level in a 
different scale within one sample. We compared this 
within-condition quantile normalization to the method 
proposed in DESeq for the simulated data and no signifi- 
cant difference was present (results are not shown here). 
For rest of this paper, we adopt the within-condition quan- 
tile normalization procedure. 

Datasets for simulation 

We generated three simulation datasets to evaluate the 
proposed method for identifying differentially expressed 
genes. Datasetl is generated with biological replicates by 
assuming different priors across conditions, which 
means G A and G B follow different statistical distribu- 
tions. Dataset2 consists of no replicates and is generated 



by following the simulation scheme adopted by baySeq 
and edgeR. Dataset3 is generated by the same scheme as 
dataset2 but with biological replicates. 

Results 

NPEBseq method 

NPEBSeq is a nonparametric empirical Bayesian-based ap- 
proach to model the RNA-seq data. The expression level 
of genes with low read counts is estimated by borrowing 
information from the gene expression in the whole sam- 
ple. The non-parametric form of the prior distribution 
avoids any unrealistic assumption. The parametric as- 
sumption for the prior distribution is usually not fulfilled 
for the RNA-seq read count data. The fact that there are 
many genes expressed at low levels in one sample is illus- 
trated in Figure 1, which is generated based on one sample 
from Marionis RNA-seq dataset [27]. This plot clearly 
shows that a large proportion of genes in a sample are 
expressed at low levels. These genes could have a high im- 
pact on the performance of statistical methods to identify 
differentially expressed genes. The fact that there are large 
numbers of genes/transcripts with low read counts and a 
small number of genes with a significantly high number of 
reads make any parametric assumption for the prior distri- 
bution unrealistic. 

Performance of NBESeq on simulated datasets 

To evaluate the proposed method for identifying differ- 
entially expressed genes, we first conducted a simula- 
tion study. 

Simulation 1 - Simulation with different priors between 
conditions (datasetl) 

To generate data similar to those produced by real RNA- 
seq experiments, we first applied the empirical Bayes 
method on publicly available RNA-seq datasets, which were 
generated to compare liver and kidney transcriptomes [27]. 
The prior distributions of kidney and liver samples were 
first estimated and then the data was normalized based on 
the expected values. The corresponding dispersion param- 
eter 6 for each condition was also estimated. 

Datasetl consist of 20 independent simulations with 
seven samples each for two conditions. The library 
size of each sample is uniformly sampled from 300,000 
to 900,000. Each sample was generated by a mixture 
of negative binomial model with both the prior distribu- 
tions and dispersion parameters estimated from Marionis 
data. Each sample consists of 10,000 genes for computa- 
tional efficiency. 

We performed a comparative analysis of our method 
with four popular methods, DESeq, edgeR, baySeq and 
NOISeq, which are available as part of Bioconductor 
packages at http://bioconductor.org [38]. The edgeR im- 
plements two ways to estimate the dispersion parameter 
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Figure 1 Distribution of number of observed reads per gene for genes with read count less than 1000. The number of genes in a RNA- 
seq dataset is shown in relation to number of mapped reads per gene. X-axis: number of observed reads per gene; Y-axis: frequency of genes. 



in its model, common dispersion and tag-wise disper- 
sion. Both of them are studied here. baySeq provides 
two choices of model (Poisson and negative binomial). 
We adopted the negative binomial model for dataset 1. 
Both DESeq and edgeR provide p-values for ranking the 
genes. baySeq provides log posterior likelihood ratio for 
ranking the differential expression of genes. In the case 
of NPEBseq, we rank the genes by p-values as defined in 
equation (4). The purpose of this simulation is to com- 
pare the ability of these methods to rank the genes in 
order of differential expression. The true ranking order 
of the genes is based on the fold change of differential 
expression values between the two conditions. 

We used the following criteria to compare the per- 
formance of different methods. Given a cutoff point 
t (e.g. the number of genes declared significantly 
expressed), the efficiency of a statistical method is mea- 
sured by p T , the expected percentage of the true first t 
DE genes being correctly declared as the first t DE 
genes. The average of estimated p T is calculated from 
the 20 replicates. The simulation results for datasetl are 
shown in Figure 2. The proposed NPEBseq method 
outperformed other methods. 

Simulation 2 - simulation with the same priors between 
conditions (dataset2 and dataset3) 

A simulation scheme similar to the one suggested by 
Robinson and Smyth [39] is applied here to generate 



dataset2 and datasetB. The library size of each sample 
was uniformly sampled from 300,000 to 900,000. The 
prior distribution of \ was assumed to be common be- 
tween the two conditions and estimated from the liver 
RNA-seq data of Marioni. 

Dataset2 was generated by Poisson distribution and 
dataset3 by negative binomial distribution, with the dis- 
persion parameter estimated from the liver data. The 
simulated data consists of 10,000 genes, and one-tenth 
of those genes were set to be differentially expressed 
(between condition A and condition B) with X A = b\ B - In 
order to produce both over- and under-expression in 
our simulated data, 500 randomly selected genes were 
set to have b=4 and the remaining 500 genes were set to 
have b=l/4. Both dataset2 and dataset3 consist of 20 in- 
dependent simulations. Dataset2 was generated without 
replicates. Similar to datasetl seven samples per condi- 
tion per simulation were generated for dataset3. The full 
ROC curves for dataset2 and dataset3 are shown in 
Figures 3 and 4, respectively. Based upon examination of 
these curves, the proposed NPEBseq method appears to 
perform better than the other methods. The partial ROC 
curves with false positive rate less than 0.2 are shown in 
Additional file 2: Figure SI and Figure S2, which indi- 
cate that NPEBSeq performs as well as the other 
methods. To clearly show that NPEBSeq can robustly 
estimate fold change of genes with low read counts, the 
estimated fold change of 10 genes from one sample of 
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Figure 2 Simulation results of comparing the performance of DESeq, edgeR and NPEBseq on datasetl. The x axis denotes t and y axis 
denotes p T . 
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Figure 4 ROC curves based on simulated dataset3. The programs evaluated are: DESeq, edgeR, baySeq, NPEBseq and NOISeq. 



dataset2 by NPEBseq along with DESeq and edgeR are 
shown in Table 1. For the cases with zero read count 
under one condition, DESeq always gives infinite esti- 
mation of fold change. 

Real RNA-seq data analysis 

To further evaluate our method, we tested it on two 
published RNA-seq datasets. 

Real RNA-seq data 1-Comparison based on one MAQC 
dataset 

We first applied NPEBSeq on the Micro Array quality 
control (MAQC) dataset [40,41] and compared with 
DESeq, baySeq, and edgeR. MAQC datasets contain 
gene expression data from multiple platforms and are 
extensively used in evaluating different data processing 



methods. We downloaded the MAQC2 Illumina RNA- 
seq data from http://www.ncbi.nlm.nih.gov/sra, which 
contains seven technical replicates of brain reference 
RNA samples and seven technical replicates of UHR 
RNA samples. Tophat [35] was used for tag alignment 
and counts for each gene were computed by means 
of HTSeq Python package (http://www-huber.embl.de/ 
users/anders/HTSeq/), using the annotation of the 
Ensembl genes and only reads that mapped to exons. 

As part of the original MAQC project, around 1,000 
genes were also chosen to be assayed by Taqman qRT- 
PCR. Those qRT-PCR data were obtained from GEO 
database, which contains four technical replicates for 
each of the two samples. The qRT-PCR data were used 
as a gold standard to benchmark the gene expression 
values by RNA-seq. We analysed the qRT-PCR data 



Table 1 Estimated fold change of 10 genes from one sample of simulated dataset2 using NPEBSeq, DESeq and edgeR 



geneJD 


9995 


9996 


9511 


9032 


9045 


9030 


9082 


3 


1 


Poisson mean under condition A 


0.7741 


1 .4868 


11.5416 


424.1334 


5.2419 


5.2419 


112.1307 


0.7741 


86.6691 


Poisson mean under condition B 


3.0318 


5.8233 


45.2038 


103.8228 


1 .2832 


1 .2832 


27.4483 


0.758 


84.8622 


TRUE fold change 


4 


4 


4 


4 


4 


4 


4 


1 


1 


observed #reads under condition A 


0 


0 


2 


365 


34 


13 


89 


0 


80 


observed #reads under condition B 


4 


2 


37 


111 


0 


1 


83 


4 


72 


estimated fc by NPEBSeq 


2.5813 


1 .8077 


13.2633 


3.3515 


18.9702 


5.8962 


1.1168 


2.5813 


1.0156 


estimated fc by DESeq 


inf 


inf 


18.6932 


3.2543 


inf 


12.8656 


1.0612 


inf 


1 .0996 


estimated fc by edgeR 


60.4244 


24.121 


58.9655 


5.3961 


1192.127 


29.3642 


1 .0736 


60.4244 


1.1299 
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using the comparative Q methods [42]. Finally, 407 
genes were defined as DE and 119 genes were defined as 
non-DE. Given the fact that not all the genes were 
assayed by qRT-PCR, we followed the same procedure 
that was applied in [15] to define the true positive and 
false positive rates. Given a "DE" or "non-DE" call from 
qRT-PCR, define a true positive (TP) as the event that 
the test of interest calls a gene DE that qRT-PCR called 
DE. A false positive (FP) event occurs when the test calls 
a gene DE that qRT-PCR called non-DE. The true posi- 
tive rate (TPR) is defined as 

(#TP and qRT-PCR is DE)/ (total # genes) 
#qRT-PCR is DE/ (total # genes) 

and the false positive rate (FPR) is defined as 

(#FP and qRT-PCR is non-DE)/ (total # genes) 
#qRT-PCR is non-DE/ (total # genes) 

Note that these are not the standard definitions of 
TPR and FPR. 

qRT-PCR data were annotated by RefSeq. The BioMart 
R package [43] was used to convert the RefSeq genes 
IDs for qRT-PCR to Ensembl genes ids. 

The ROC curves from all the compared methods are 
shown in Figure 5. Clearly, our proposed method has the 
best performance in terms of sensitivity and specificity. 



Real RNA-seq data 2-Detecting differential usage of exons 
from RNA-seq data 

We also analysed the data by Brook et al. [44], where the 
effect of the RNAi knockdown of "pasilla" was studied 
by RNA-seq in the Drosophila melanogaster cell line. 
The data was downloaded as part of DEXSeq package. 
The data consists of four control samples and three 
knockdown samples. The analysis at gene level by 
NPEBseq reported 107 differentially expressed genes, 
with nominal FDR control at 0.1 for the comparison of 
control and knockdown. To access the specificity of the 
NPEBseq method we performed in-condition compari- 
son by making use of the fact that there are four bio- 
logical replicates in the control group. We applied 
NPEBseq for the comparison of two control samples ver- 
sus the other two. NPEBseq reported zero differentially 
expressed genes with FDR control at 0.1, which indicates 
that NPEBseq has a very high specificity. 

We then analysed Brook s data at exon level. NPEBseq 
found differential exon usage for 2,370 counting bins at 
FDR 0.01 for between-condition comparison and 225 
counting bins for within- condition comparison. We also 
applied the newest version of DEXSeq on the exon data, 
which reported 120 counting bins as DE at FDR 0.1. We 
checked whether NPEBseq and DEXSeq could achieve 
comparable results by computing the percentage of DE 
called exons that are common in the two ranked lists 
of exons generated by both programs. The results are 



d 




DESeq 

NPEBseq 

edgeR-comm 

edgeR-tag 

baySeq 

NOISeq 




FPR 

Figure 5 ROC curves based on MAQC2 real RNA-seq data: Comparison of the performance of DESeq, edgeR, baySeq, NPEBseq and 
NOISeq methods. We declared non-DE if its RT-qPCR absolute log-ratio was less than 0.2 and DE if its absolute log-ratio was greater than 2.0. 
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The top number of counting bins rank by NPEBseq and DEXSeq 

Figure 6 Percentage of DE exons that are common in the two ranked lists of exons generated by NPEBseq and DEXSeq programs. 

While x-axis denotes the number of declared DE exons and the y-axis denotes the percentage of common calls between both the programs. 



shown in Figure 6. For example, we found that 74 
counting bins (exons) were common among the top 120 
DE counting bins called by each approach. And, further 
examination revealed that among the top 120 DE 
counting bins identified by NPEBseq, 12 were defined as 
"untestable" by the DEXSeq method due to low read 
counts in those counting bins. Since the p-value defined 
in NPEBseq is different from the regular p-value, we 
didn't expect these two approaches to report similar 
number of DE exons at the same FDR level. 

Discussion 

In this paper we developed a novel empirical Bayesian- 
based approach to model the RNA-seq data. This 
method has been widely used in ecology to estimate spe- 
cies diversity [26]. The nonparametric form of the prior 
distribution of the Bayesian model is empirically esti- 
mated from the data. The expression level of genes with 
low read counts are estimated by borrowing information 
from the gene expression in the whole sample. For data 
with biological replicates, we developed a hierarchal 
Bayesian model to account for the over-dispersion and 
proposed an empirical Bayesian method to estimate the 
dispersion parameter. We also extended the model 
to detect differential usage of exons from RNA-seq 
datasets. The closed-form formula of the posterior distri- 
bution makes the computation of any statistics very effi- 
cient. At the final step, we evaluated the performance of 



this method in detecting the differentially expressed genes 
by conducting simulation and real RNA-seq data analysis. 

There are many challenges still present in the process- 
ing and analyses of RNA-seq data. For example, it has 
been empirically observed that quantification of expres- 
sion depends on the length of the biological features 
under study (genes, transcripts, or exons), as longer fea- 
tures tend to have more significant statistics than shorter 
ones [45]. Also, it was recently shown that there exists a 
sample-specific guanine-cytosine content (GC-content) 
effect and the studies proposed normalization methods 
by GC-strata to remove such effects [46,47]. Incorporat- 
ing those factors into our model could further improve 
the performance. 

Delineating the gene expression at an alternative 
transcript-level from RNA-seq data is still a very challen- 
ging problem. Our recently published IsoformEx method 
[3], based on non-negative least square, is aimed to esti- 
mate transcript abundance. In future enhancements to 
the proposed method we will integrate NPEBseq with 
IsoformEx to identify DE at isoform-level. 

Conclusions 

NPEBseq can be applied to not only detect differential 
gene expressions from the RNA-seq dataset with tech- 
nical and biological replicates for both studied condi- 
tions, but also to detect differential usage of exons. It is 
robust, since it requires no limited assumptions to be 
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made about the prior distribution of the data. NPEBseq 
also provides the closed form of posterior distribution of 
the fold change, which is useful for further analysis. 

Availability and requirements 

Project name: NPEBseq 

Project home page: http://bioinformatics.wistar.upenn. 
edu/NPEBseq 

Operating system and R version: The R package is 
platform independent and is compatible with all the ver- 
sions of R same as or higher than 2.15.1. 
Other requirements: No. 
License: GPL (> 2) 

Any restrictions to use: It is available for free download. 
Additional files 



Additional file 1: This file describes the procedure to derive the 
marginal posterior distribution of Aj and the procedure to infer the 
prior distribution of our proposed model. 

Additional file 2: Figure SI. Partial ROC curves based on simulated 
dataset2. Figure S2 Partial ROC curves based on simulated dataset3. 
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